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The  aim  of  this  paper  is  to  review  the  available  Soviet 
publications  on  the  topic  of  parallel  data  processing  published  during 
1978-1980.  (A  few  articles  published  outside  of  this  interval  are  also 
included  as  they  were  relevant.)  The  following  is  the  list  of  Soviet 
periodicals,  including  their  English  translations  if  available,  that 
have  been  examined  for  this  review.  The  abbreviations  used  in  the 
bibliography  are  also  given  here. 


ACCS  r  "Automatic  Control  and  Computer  Sciences",  by  Allerton 
Press  Inc.,  Translation  of  "Avtomatika  i  Vychislitel'naya  Tekhnika" 
(Russian),  Izdatel'stvo  "Zinatne",  Riga,  USSR. 

AT  -  "Avtomatika  i  telemekchanika"  (Russian),  Izdatel'stvo 
"Nauka".  The  English  translation  is  "Automation  and  Remote  Control"  by 
Plenum  Publishing  Corporation. 

C  -  "Cybernetics"  by  Consultants  Bureau.   Translation  of 

K  -  "Klbemetika"  (Russian),  Izdatel'stvo  "Naukova  dumka",  Kiev, 
USSR. 

PCS  -  "Programming  and  Computer  Software",  by  Consultants  Bureau. 
Translation  of  "Programmirovanie"  (Russian). 

TK  -  "Tekchnitcheskaia  kibernetika"  (Technical  cybernetics)  in  the 
"Izvestia  AN  SSSR"  series  (Russian),  Izdatel'stvo  "Nauka",  Moscow, 
USSR. 
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l .Classif icacion  of  articles.  General  questions. 


It  appears  difficult  to  give  a  good  classification  of  the  topic. 
Articles  [19,24,44,84]  contain,  in  connection  with  "state  of  '  the  art" 
exposition,  some  classifications  of  the  topic.  The  author  of  [24] 
suggests  a  classification  which  we  reproduce  below.  In  parentheses  are 
the  number  of  articles  to  which  we  refer  concerning  the  given  topic  or 
subtoplc. 

Methods  and  means  of  parallel  data  processing: 

1.  General  questions.  (4  articles). 

2.  Parallel  programming  (PP).  (38  articles). 

3.  Parallel  computational  methods  (PCM).  (12  articles). 

4.  Parallel  computing  systems  (PCS).  (34  articles). 

5.  Miscellaneous.  (No  articles). 

Tdpics  2-5  are  subdivided  further  as  follows. 

2.  Parallel  programming; 

2.0.  General  questions.  (1  article). 

2.1.  Schemes  and  models  of  parallel  programs.  (27  articles). 

2.2.  Languages,  methods  and  systems  of  PP.  (19  articles). 

2.3.  Operating  systems  in  PP.  (7  articles). 

2.4.  Control,  estimation  and  efficiency  of  parallel  programs.  (6  articles). 

2.5.  Concrete  parallel'  programs.  (No  articles). 

2.6.  Miscellaneous.  (No  articles). 

3 .  Parallel  computational  methods; 

3.0.  General  questions.  (2  articles). 

3.1.  Complexity  of  parallel  data  processing.  (No  articles). 

3.2.  PCM  of  numerical  analysis.  (6  articles). 

3.3.  PCM  of  combinatorial  and  discrete  mathematics.  (2  articles). 

3.4.  PCM  of  mathematical  programming.  (2  articles). 

3.5.  PCM  of  mathematical  statisics,  probability  theory  and 
information  theory.  (No  articles). 

3.6.  Concrete  parallel  methods  and  algorithms.  (1  article). 

3.7.  Miscellaneous.  (No  articles). 

4.  Parallel  comoutinz  systems; 

4.0.  General  Questions-  (1  article). 

4.1.  Classification  of  PCS.  (1  article). 

4.2.  SISD  systems.  (3  articles). 

4.3.  SIMD  systems.  (6  articles). 

4.4.  MISD  systems.  (5  articles). 

4.5.  MIMD  systems.  (13  articles). 

4.6.  High  reliability  PCS  (4  articles). 

4.7.  Real-time  PCS.  (3  articles). 

4.8.  Estimation  and  efficiency  of  PCS.  (15  articles). 

4.9.  Miscellaneous.  (No  articles). 
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This  classification  is  obviously  not  perfect.  Many  articles  are 
concerned  with  several  topics  and  subtopics  at  once.  But  as  it  covers 
the  largest  variety  of  topics  among  the  suggested  classifications 
(discussed  later)  we  will  take  it  as  a  base  for  our  classification. 

Some  of  our  sections  cover  several  subtopics  at  once.  One  of  the 
reasons  for  this  is  that  the  distribution  of  Soviet  work  among  the 
topics  listed  is  not  even.  (The  survey  [24]  lists  many  non-Soviet 
articles  and  does  not  have  a  lower  chronological  limit  as  ours  does). 

The  accent  on  theoretical  work  in  this  distribution  is  evident. 
The  majority  are  concerned  with  parallel  programming  methods,  mostly  as 
theoretical  discussion  of  possible  schemes,  models,  languages,  methods 
and  systems  (2.1  and  2.2).  Another  frequently  encountered  topic  is  the 
discussion  of  MIMD  computing  systems,  usually  concerned  with  possible 
configurations  of  such  systems  and  .  estimation  of  their  efficiency. 
Development  of  concrete  parallel  programs  and  computational  methods  are 
discussed  in  a  small  number  of  referenced  articles.  Only  one  article 
[37]  was  devoted  to  the  problem  of  parallel  program  correctness.  These 
facts  could  indirectly  testify  to  the  low  degree  of  practical 
implementation  of  parallel  data  processing  methods  in  the  USSR. 
However  one  also  has  to  take  into  consideration  the  fact  that  most  of 
the  results  of  the  practical  implementation  of  these  systems  are  not 
available  even  if  published  in  the  USSR.  The  few  exceptions  are  the 
two  publications  [34,35]  of  M.  A.  Kartsev  devoted  to  the  Soviet 
parallel  computer  M-10,  the  reference  in  [59]  to  the  system  "Elbrus" 
and  the  reference  in  [50]  to  the  system  M-6000. 

Article  [44]  is  the  first  part  of  a  review  of  the  state  of 
parallel  language  design  (section  2.0  in  the  classification  we  have 
accepted).  The  author  notes  that  the  very  notion  of  parallel 
computation  fragments  is  difficult  to  define.  He  also  notes  that  "the 
currently  available  methods  of  formal  description  of  abstract  parallel 
processes  have  not  matured  to  the  stage  of  basic  mathematical 
constructs  that  can  be  satisfactorily  used  in  the  theory  of 
computational  parallelism". 
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"A  simple,  intuitive"  classification  is  proposed  Co  "identify 
certain  typical  situations  among  the  rich  variety  of  parallel 
structures".  This  classification  follows. 

:,      Type         of  fragment          : 

•  ■ 

:     micro     :     basic  ;     macro     : 

Level      high  t            Nonprocedural  languages          • 


of  t      Expressions   «   Statements    c   Subprograms  • 


•  «  9  A  «  A  e  < 


abstraction   low   :       Machine-oriented     languages 


The  Space  of  Software  Control  Structures 


Beginning  with  constructs  that  are  included  in  parallel  FORTRAN  and 
ALGOL-like  language  extentions  (like  FORK-JOIN,  SPLIT-ASSEMBLE, 
START-HALT,  PARBEGIN-PAREND ,  COBEGIN-COEND ,  etc.)  and  then  discussing 
semaphores,  common  resource  constructions,  and  the  monitor  concept,  the 
author  moves  up  the  right  part  of  the  above  diagram.  All  these 
constructs  are  extracted  from  non-Russian  work.  However  while 
discussing  Hoare's  new  suggestions  concerning  the  organization  of 
interacting  sequential  processes  [28]  the  author  mentions  the 
implementation  of  these  concepts  in  experimental  projects  of 
homogeneous  computer  systems  (odnorodnie  vitchislitel'nie  systemi), 
described  in  article  [64],  of  Che  Novosibirsk  group.  (The  paper  [64] 
is  not  available.  The  description  of  Che  Novosibirsk  project  from  [44] 
will  be  quoted  in  section  5.  of  this  review.) 
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While  covering  the  left  part  of  the  diagram  the  author  refers  to 
parallel  computers  (non-Russian)  with  vector,  matrix,  and  associative 
processes  (STAR-100,  ASC,  CRAY-1 ,  ILLIAC  IV,  PEPE,  STARAN  and  others). 
All  these  machines  are  designed  for  parallel  processing  of  vector  and 
matrix  data  structures,  using  an  essentially  synchronous  organization 
of  sequential-parallel  computations.  Article  [21]  of  the  scientists  of 
the  Ukrainian  Institute  of  Cybernetics  in  Kiev  is  referred  to  by  [44], 
which  discusses  all  aspects  of  such  computation  using  the  concept  of 
synchronous  parallel  operators.  We  will  discuss  article  [21]  in 
section  4.  of  this  review. 

Review  [84]  describes  the  All-Union  Symposium  on  the  development 
of  Software  for  High-Performance  Computational  Systems  sponsored  by  the 
Ukranian  Institute  of  Cybernetics.  The  list  of  main  topics  discussed 
in  the  Symposium  follows: 

1.  Theory  of  Parallel  Programming. 

2.  Theory  of  Algorithmic  Algebras  and  Parallel  Computations. 

3.  Methods  of  Constructing  Software  for  High-Performance 
Computational  Systems^. 

4.  Models  of  Parallel  Computation  and  Methods  of  Realizing  Them. 

5.  Parallel  Programming  Systems. 

6.  Software  for  High-Performance  Computational  Systems^. 

The  1979,  No.I  issue  of  Kibemetika  is  exclusively  devoted  to  the 
publication  of  reports  of  this  Symposium. 


'•The  English  term  "software"  is  used  here  to  translate  two  different 
Russian  terms.  In  point  3  it  means  "Mathematical"  software 
(matematicheskoe  obespetchenie) ,  in  point  6  it  means  "Programming" 
software  (programmnoe  obespechenie) •  It  seems  difficult  to  state  the 
difference  between  these  two  terms.  Probably  the  "Mathematical" 
software  relates  more  closely  to  the  theoretical  questions  of  the  topic 
while  the  "Programming"  software  is  closer  to  the  implementation  level. 


-7- 

Analyzing  the  spectrum  of  reports  of  this  symposium  one  can 
distinguish  the  existence  of  two  schools  of  thought  concerning  the 
topic  of  parallel  data  processing:  Systems  of  Algorithmic  Algebras 
(Russian:  sistemi  algoritmicheskikh  algebr) ,  lead  by  scientists  of  the 
Ukranian  Institute  of  Cybernetics,  and  Homogenous  Computational  Systems 
(Russian:  odnorodnie  vitchislitel'nie  sistemi),  developed  by  the 
Novosibirsk  Institute  of  Mathematics  (N.  N.  Mirenkov  and  others).  The 
computer  system  MINIMAX  was  referred  to  by  the  representatives  of  the 
second  school  as  an  implementation  of  their  theories  at  this  Symposium. 
The  first  approach  is  represented  by  articles  [22,26,62,66]  in  our 
review,  the  second  is  represented  by  [15,40,41,63,64].  There  is  also 
an  approach  of  uniform  computing  arrays  (Russian:  odnorodnie 
vichislitel'nie  structuri).  It  was  not  represented  in  this  Symposium 
although  work  detailing  this  approach  were  also  present  among  the 
referenced  literature  [3,4,29]. 

The  view  of  V.  M.  Gluskov,  the  leader  of  the  Ukrainian  Academy  of 
Sciences  Institute  of  Cybernetics,  on  the  topics  in  question  is  of 
interest.  He  expresses  in  [19]  the  following  ideas  about  parallel 
processing,  its  current  status,  and  the  program  of  its  development: 

"A  critical  stage  in  the  development  of  this  technology  (i.e. 
that  of  parallel  programming-B,L. )  is  the  transition  from  parallel 
design  of  particular  computational  methods  (on  substantive  or 
semisubstantive  level)  to  formal  parallel  programming.  Note  that  in 
most  cases  linear  sections  do  not  require  parallel  processing.  Indeed, 
the  prospects  of  programming  technologies  on  the  whole  are  such  that  in 
very  near  future  even  the  longest  programs  (and  hence  their  linear 
sections)  should  not  exceed  several  million  instructions.  Given  modern 
computer  speeds,  a  million  instructions  are  performed  in  fractions  of  a 
second,  which  is  quite  sufficient  in  most  applications...  The 
situation  is  different  with  regard  to  loops  and  especially  systems  of 
nested  loops.  Loops  are  the  most  time-consuming  elements  in  programs, 
and  the  main  efforts  of  parallel  computation  should  focus  on  the 
elimination  of  loops.  A  number  of  recent  publications  have  laid  the 
theoretical  foundation  for  formal  methods  of   parallel  processing   in 
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programs  in  general  and  loops  in  particular.  ...  Current  work, 
attempts  to  use  this  theoretical  foundation  to  construct  practical 
automatic  programming  systems  implementing  parallel  execution." 


2.  Parallel  programming:  schemes,  models,  languages, 
methods,  and  systems. 


We  will  not  try  to  impose  any  strict  classification  on  the 
subject,  instead  simply  grouping  the  articles  into  several  clusters. 
The  authors  in  [22]  develop  the  concept  of  systems  of  algorithmic 
algebras  (SAA)  [18]  as  a  method  of  representing  schemas  of  structured 
parallel  programs.  According  to  the  authors  the  operations  of  such  an 
algebra  contain  the  main  constructs  of  the  structured  programming 
suggested  in  [18].  The  SAA,  according  to  the  article,  is  associated 
with  a  model  of  a  multiprocessor  system  and  is  oriented  toward 
formalization  of  synchronous  and  asynchronous  parallel  computations. 
On  the  conceptual  level  the  article  considers  the  problem  of  parallel 
syntactic  analysis,  which  combines  bilateral  analysis  of  text  and 
conveyer  (pipe-line)  computations. 

The  concept  of  an  SAA  is  used  or  mentioned  in  [26,62,66]. 

The  paper  [26]  describes  some  results  of  development  of  a 
programming  system  for  the  automation  of  multilevel  structured  design 
of  parallel  programs.  The  system  is  based  on  the  SAA  method.  It 
includes  several  levels  of  program  translation  beginning  with  an 
SAA-scheme  and  finishing  with  a  concrete  parallel  program.  Each  level 
of  the  system  has  to  include  the  package  of  service  programs  for 
generating  and  analysing  the  source.  Algorithms  for  several  programs 
of  the  first  (the  most  abstract)  level,  mostly  those  of  the  syntactic 
analyser,  are  discussed. 
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Article  [62]  represents  a  detailed  example  of  an  algorithm 
analysis  using  the  SAA  method.    __ 

The  article  [66]  discusses  the  problems  of  conveyer  (pipe-line) 
parallel  translation  and  syntactic  analysis,  for  the  class  of 
COBOL-like  programs,  bilateral  and  multilayer  processing.  The 
SAA-method  is  used  here  as  a  representation  for  the  programs. 

Some  articles  are  devoted  to  the  developments  and  applications  of 
the  concept  of  the  A-model  of  parallel  processing  [46]. 

A  concept  for  extensible  control  structures  in  parallel  programs, 
analogous  to  that  for  extensible  data  structures,  is  developed  in  [45]. 
The  control  structure  of  parallel  processeses,  which  can  be  a  complete 
chaos  as  in  the  A-model  has  to  be  explicitly  described  by  the 
programmer  in  a  special  section  of  his  program.  Various  structures  of 
control  adapted  to  concrete  implementation  can  be  obtained.  (The 
mutex,  sequential,  parallel  fork-type  control,  the  contol  for  lattice 
calculations  etc.  are  given  as  examples  of  this  method).  The  author 
refers  to  [9,54]  as  having  introduced  ideas  with  an  analogous  purpose 
but  with  different  implementation.  Generally  the  method  of  [45]  allows 
the  programmer  to  build  any  type  of  control  that  can  be  expressed  by  a 
structure  of  a  specially  modified  Petri  net  model. 

The  Petri  net  model  is  adopted  as  a  convenient  tool  for  the 
problem  of  the  synthesis  of  microprogrammable  automata  in  [5],  where 
the  problem  of  translating  from  the  initial  graph-scheme  (flow-chart) 
of  the  program  to  automata  realization  is  discussed.  The  following 
procedure  to  do  this  is  presented: 

l.The  transformation  of  the  initial  graph-scheme  into  Che  form  of 
a  Petri  net. 

2. The  shortening  of  the  derived  net,  using  equivalence 
transformations. 

3. The  realization  of  the  shortened  net  as  a  finite  automaton. 
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If  the  size  of  Che  net  exceeds  the  allowable  size  for  one  automaton, 
then  a  decomposition  of  the  Petri  net  and  then  the  construction  of  a 
realization  consisting  of  a  net  of  automata  are  possible. 

The  authors  of  [30]  analyse  the  implementation  of  the  trigger 
functions  (spuskovikh  funktcii)  of  the  A-model  of  parallel  programs. 
It  is  shown  that  the  time  required  for  dynamic  control  of  the  trigger 
function  is  proprtional  to  the  length  of  the  program.  Ways  of  reducing 
the  time  are  investigated  in  the  paper.  Some  static  precalculations 
determining  whether  statements  are  ready  for  execution  are  suggested. 
The  idea  of  these  precalculations  is  similar  to  those  presented  in  the 
other  article,  [29],  by  these  authors. 

The  dynamic  control  of  parallel  processing  in  terms  of  the  A-model 
is  a  part  of  an  automatic  parallelization  problem  and  corresponds  to 
dynamic  parallelization  in  which  the  order  of  concurrent  execution  is 
to  be  defined  at  execution  time. 

The  article  [29]  discusses  another  mode  of  parallelization,  namely 
static  parallelization.  The  method  of  automatic  static  parallelization 
discussed  uses  an  information-logic  graph  as  the  model  of  a  program. 
The  program  statements  that  can  be  parallelized  at  any  execution  are 
recognized.  Other  operators  are  allocated  to  special  constructions 
called  branches.  Indications  are  given  as  to  the  feasibility  of  the 
parallel  execution  of  the  branches.  This  method  of  parallelization  is 
oriented  toward  uniform  computer  arrays.  An  estimate  of  the  complexity 
of  such  a  realization  is  given. 

Besides  the  work  mentioned  above,  the  topic  "Automatic 
parallelization"  is  considered  in  [36,37],  where  the  authors  suggest  a 
technique  for  automatic  parallelization  of  program  structures  such  as 
arithmetic  expressions  or  special  kinds  of  program  schemata  that  form  a 
hierarchy.  For  the  special  case  of  arithmetic  expressions  the 
algorithm  is  compared  with  that  of  [71]. 
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lt  may   be  of   interest   to   quote  here  one  of   the   Soviet 
"practicians",  M.A.Kartsev,  on  automatic  parallelization: 

While  discussing  in  [35]  the  parallel  computer  M-IO  (which  we  will 
discuss  in  section  4.  of  the  review)-,  he  says:  "Let  the  program  (which 
one  wants  to  parallelize  -B.L.)  be  written  in  ALGOL-60  or  FORTRAN 
without  any  additional  indications  (i.e.  those  added  to  it  to  express 
parallelism  explicitly  -  B.L.),  and  it  is  required  to  realize  the 
parallelism  of  independent  branches.  Then  the  translator  (compiler  - 
B.L.)  has  to  examine  the  possibility  of  parallel  execution  of  all 
program  sections  of  the  same  level  (blocks  of  ALGOL  or  subroutines  of 
FORTRAN).  It  is  not  very  difficult  to  formalize  the  rules  according  to 
which  the  translator  could  reveal  a  possibility  of  executing  a  pair  of 
the  sections  in  parallel.  However,  the  work  required  of  the  translator 
for  large  programs  is  tremendous,  as  it  necessitates  checking  all 
pairs... That  is  why  it  is  required  of  the  programmer  to  indicate  the 
parallelism  of  independent  branches  explicitly  using  some  additional 
means  introduced  into  the  language. ».  Formal  rules  of  acion  for  the 
translator  in  the  case  (of  loop  parallelization:  B.L.)  are  of  the  same 
complexity  as  when  the  translator  reveals  the  possibility  of  parallel 
execution  of  two  sections  of  the  program  when  using  branch  parallelism. 
But  the  job  of  the  translator  is  essentially  easier  as  each  loop  is 
investigated  separatly  (with  possible  embedded  nested  loops)  and  it  is 
not  required  to  consider  jointly  several  loops  at  once." 

The  related  problem  of  loop  parallelization  is  considered  in  a 
number  of  articles  [55,56,65,67,76,79]. 

Articles  [56,65]  discuss  a  method  for  generation  of  sets  of 
parameter  values  for  nested  loop  parallel  execution  such  that  the 
computations  for  each  such  a  set  can  be  performed  in  parallel  while  the 
sets  are  looked  over  sequentially.  Essentially,  the  hyperplane  method 
of  Lamport  [53]  is  employed  in  this  work. 
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Article  [76]  represents  apparently  similar  results.. 

Article  [67]  considers  the  problem  of  parallelization  taking  into 
account  array  structures  of  data. 

Article  [55]  was  not  available  but  it  is  referred  to  in  [65]  where 
the  authors  describe  the  following  static  parallelization  procedure, 
suggested  in  [55].  "The  first  step  of  the  algorithm  involves  the 
isolation  in  the  program  of  constructs  such  as  linear  segments  and 
simple  loops,  which  represent  individual  vertices  in  the  generalized 
graph  of  control  links.  (The  graph  is  a  variant  of  a  flow-chart, 
considered  by  the  author.  The  problem  is  to  construct  this  graph, 
given  the  program  code  in  high-level  language  and  to  parallelize  it: 
B.L.)  The  next  stage  in  the  procedure  is  the  distribution  over  local 
levels  of  elementary  statements  that  make  up  the  body  of  a  simple  loop 
or  that  appear  in  an  elementary  segment,  so  as  to  minimize  the 
execution  time  for  each  of  these  program  constructions.  The  next  stage 
is  to  isolate  and  parallelize  complex  loops  such  as  the  embedded  DO 
loops  of  FORTRAN.  The  procedure  for  transforming  such  loops  to  a  form 
convenient  for  parallel  execution  is  based  on  the  determination  for 
each  DO  loop  of  the  highest  and  lowest  limits,  these  being  functions  of 
the  variables  of  these  loops,  and  checking  the  existence  of  a  solution 
of  a  system  of  Diophantine  equetions  that  relate  the  variables  of  the 
loops  in  question.  In  the  next  stage  of  the  algorithm  we  restructure 
the  generalized  graph  of  control  links  between  program  statements, 
making  it  possible  to  execute  individual  generalized  statements  not  in 
the  order  in  which  they  occur  in  the  initial  program,  but  in  an  order 
which  can  speed  up  the  execution  time  of  a  program  as  a  whole.  This  is 
acheived  by  constructing  a  dominant  graph  of  the  program,  isolating 
ordered  linear  sequences  in  it,,  and  combining  them  into  independent 
solution  branches  of  [53]". 

In  [79]  a  practical  method  of  parallelization  is  described.  A 
parallel  language  at  two  levels,  abstract  and  concrete,  a  loop 
parallelization  algorithm,  parallelization  of  procedures  with  formal 
parameters,  automatic  branch  generation  and  branch  merging  mechanisms 
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are  represented.  The  author  himself  characterizes  these  results  as 
"fast  and  robust"  in  contrast  with  theories  that  have  not  yet  matured 
to  the  stage  of  practical  implementation.  These  methods  are  oriented 
toward  systems  with  shared  memory  where  the  number  of  processing 
elements  is  relatively  small.  The  author  refers  to  the  implementation 
of  the  proposed  system  as  a  program  that  transforms  programs  written  in 
FORTRAN  IV  with  additional  parallel  features  (PARDO,  SYNCH,  FORK-JOIN). 
There  is  no  limitation  in  principle  on  the  object  code  language, 
according  to  [79].  The  algorithm  was  programmed  on  BESM-6  in 
Novosibirsk's  branch  of  ITM  and  VT  (this  probably  stands  for  Institute 
Tochnoy  (Precise)  Mechanique  and  Vitchislitel'noi  (Computational) 
Technique) . 

In  [63]  the  issue  of  comparative  power  of  synchronization 
primitives  is  studied.  One  system  of  primitives  is  more  powerful  than 
another  if  the  first  system  can  model  the  second.  The  paper  introduces 
a  definition  of  the  notion  "one  system  models  the  other"  which  is  more 
general  than  that  in  [57].  It  introduces  "service  semaphores"  that 
represent  "service  breaks"  (sludzebnie  ostanovi).  The  breaks  may  exist 
in  real  implementations  of  a  synchronization  system  and  can  be  a  cause 
of  effects  that  are  ignored  by  the  definition  in  [57].  In  the  absence 
of  such  service  semaphores  the  notion  "to  model"  of  the  paper  coincides 
with  that  of  [57], 

The  solvability  of  a  problem  of  correctness  of  graph  schemes 
(dataflow)  of  parallel  algorithms  is  proved  in  [39].  Correctness  is 
defined  as  the  impossibility  of  any  operator  being  activated  (i.e. 
obtaining  all  entries)  while  another  execution  is  taking  place. 

The  problem  of  relevant  descriptions  of  parallel  programs,  in 
itself  or  in  connection  with  a  study  of  properties  of  parallel 
programs,  is  discussed  in  [6,32,50,51,52,67,76,78]. 

The  most  commonly  used  approach  is  to  refine  the  usual  flow-chart 
symbolism  for  the  parallel  program.  In  [52]  such  a  refinement  is 
suggested.   Further  refinements  [50,51]  lead  to  an  approach  based  on  a 
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class  of  well-developed  program  languages  of  recursive  functions. 
According  to  the  authors  of  [50,51]  it  helps  to  analyse  the  properties 
of  programs  at  the  functional  level.  In  [50]  the  authors  refer  to  an 
implementation  of  their  theory  on  the  "graphical  version  of  the 
parallel  algorithms  language"  that  is  based  on  this  theory  and  is 
implemented  on  the  system  M-6000.  In  [51]  a  model  of  asynchronous 
computation  of  the  values  of  functions  assigned  to  the  languages  of 
functional  schemes  is  studied.  It  is  demonstrated  that  the  model  under 
consideration  is  correct. 

Another  refinement  of  the  flow-chart  notion  is  operator  schemata. 
In  [77]  the  author  researches  the  possibility  of  automatic 
transformation  of  operator  schemata  into  an  asynchronous  form,  which 
admits  leading  calculation  (e.g.  prefetch  data).  The  author  is 
especially  concerned  about  not  allowing  "uneconomical"  leading.  By 
this  he  means  the  pre-advancing  of  parts  of  calculations  that  can  not 
be  used  under  any  circumstances  in  the  future.  A  special  operator 
schema  is  introduced  for  this  purpose. 

In  [78]  the  model  of  operator  schemata  is  used  to  discuss  the 
maximal  parallelism  problem. 

In  [68]  the  problem  of  program  equivalence  is  discussed  using  the 
operator  schemata  approach. 

An  attempt  to  study  the  problem  of  parallel  decomposition  of 
algorithms  on  the  basis  of  the  theory  of  normal  algorithms  is  made  in 
[6].  The  problem  of  parellel  decomposition  is  treated  as  a  problem  of 
decomposing  a  normal  algorithm  into  a  parallel  form  of  composition 
(union) . 

In  [32]  the  idea  of  information  structure  of  programs  is 
introduced.  It  is  defined  as  a  set  of  restrictions  on  mechanisms  of 
formation  of  information  connections  in  the  program.  In  a  narrow  sense 
it  is  defined  as  a  restriction  on  mechanisms  for  selection  of 
predecessors  of  operators  within  the  graph. 
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Papers  [2,69,79]  contain  some  practical  suggestions  on  parallel 
programming,  languages  and  systems  organization. 

In  [2],  some  new  language  constructs  for  common  memory  model 
computers  are  suggested.  According  to  the  authors,  these  constructs 
are  well  adapted  for  mathematical  optimization  problems  which  are 
characterized  by  a  presence  of  large  logically  closed  pieces.  The 
parallelization  is  treated  as  external  to  these  pieces.  The 
computational  units  -"processors"-  are  explicitly  represented  in  the 
language.  Each  "processor"  is  explicitly  connected  with  several  others 
and  exchanges  information  with  them.  The  language  is  based  on 
ALGOL-68. 

A  practical  procedure  of  parallel  syntactic  analysis  is  discussed 
in  [69]. 

Lastly,  we  mention  articles  [33,83,86],  which,  in  the  context  of 
some  other  question  (for  example  that  of  efficient  parallel 
computational  methods  as  in  [83,86],  or  machine-level  organization  of 
parallel  computation)  demonstrate  a  tendancy  to  introduce  and  to  make 
use  of  a  low-level  representation  of  operands.  However,  nothing  like  a 
theory  of  "low-level  parallelization"  was  found  in  the  reviewed 
articles. 


3. Parallel  computational  methods. 

Articles  [7,10,11,12,20,27,31,43,62,70,83,86]  discuss  various 
aspects  of  the  development  of  parallel  computational  methods. 

In  [7,43]  problems  of  large  size  are  considered.  These  require 
cooperative  efforts  of  many  computer  installations  and  a  great  deal  of 
of  computation. 
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In  [7]  the  interproduct  balance  problem  that  arises  in  planned 
economies  is  discussed.  Mathematically,  it  is  a  problem  of   solving  a 

linear  system  of  algebraic  equations  of  a  very  high  dimensionality  (up 

1 2 
to  10   bytes  of  memory  are  required   to   represent   these   equations, 

according   to   the   author).   An  algorithm  of  iterative  aggregation  is 

adopted  for  its  solution.   It  works  as  follows:  a  central  processor 

submits  to  branches  (subsidiary  processors)  a  plan  including  the 

resources  available  to  the  branch.   Because  of  the  high  dimensionality 

the  information  is  represented  in  aggregated  form.   The  branches,  using 

detailed  data  bases,  compute  the  productivity  they  can  provide  with  the 

given  resources.   The  central  processor  corrects  the  aggregated  plan 

and  submits  it  to  the  branches  again,  and   so  on.   The  algorithm  is 

supposed   to   be   performed   in   parallel   using   a  network  of 

circle-connected  computers  each  of  which  represents  a  branch  or  the 

center.   It  is  represented  using  SAA-notation  (as  in  [18,22])  and  uses 

collector  (from  all  to  one),  translational  (from  one  to  all)  and 

translational-cyclic  (from  each  i  to  f(i),  where  f(i)  is  a  permutation) 

exchange  of  information  between  the  computers.   The  evaluation  of 

performance  time  for  the  ES-1020  computer  is  discussed. 

In  [43]  the  problem  of  analysis  and  calculation  of  complex  systems 
using  multi-computer  complexes  is  considered  in  general.  The 
dlacoptics  method  of  G.  Kron  [48,72]  is  applied  to  disperse  such 
problems  among  the  computers. 

Articles  [11,31]  consider  some  "grid"  parallel  processors  for  the 
solution  of  Mathematical  Physics  problems. 

In  [11],  after  the  presentation  of  the  "grid"  scheme  for  the 
solution  of  a  problem,  using  a  specialized  logical  processor  network, 
methods  for  investigating  the  stability  of  transient  processes  in  this 
network  based  on  stability  theory  are  demonsrated.  Ways  of  insuring 
stability  of  transient  processes  are  considered  for  synchronous  and 
asynchronous  logical  networks. 
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In  [31]  a  solution  of  Poisson's  difference  equation  is  suggested 
and  analyzed.  The  PE  network  connection  is  a  plane  grid.  This  special 
processor  is  supposed  to  be  connected  to  a  universal  computer.  Speed 
estimates  show  an  increase  of  productivity  of  several  orders  of 
magnitude  compared  with  the  STARAN  IV  system,  when  it  is  used  for  the 
solution  of  the  same  problem. 

In  [12]  qualitative  arguments  are  presented  that  serve  to 
determine  whether  it  is  possible  to  parallelize  a  computation  of  a 
physical  problem.  The  performance  of  the  method  is  demonstrated  for  a 
problem  of  subsonic  stationary  flow  around  a  cylinder. 

The  problem  considered  in  [10]  is  that  of  optimal  automaton 
design.  It  is  shown  that  it  can  be  considered  as  an  integer  linear 
programming  problem  with  Boolean  variables.  The  latter  problem  is 
reduced  to  a  distributive  network  problem.  This  makes  it  possible  to 
employ  multiprocessing. 

In  [70]  parallelization  of  computation  of  a  polynomial  of  one 
variable  is  considered. 

In  [83]  vectors  and  matrices  are  represented  as  elements  of 
special  algebras  which  in  turn  are  mapped  into  real  numbers  so  that 
matrix  operations  are  mapped  into  the  usual  operations  on  real  numbers. 

In  [86]  a  tridiagonal  linear  system  is  supposed  to  be  solved  by 
the  pivotal  condensation  method.  Although  the  method  is  not  parallel 
the  authors  suggest  a  "quasi-parallel"  method.  It  uses  several 
operating  units  (OU)  that  perform  computations  in  a  combined  mode  which 
enables  the  next  OU  to  make  computations  with  the  digits  that  have  just 
been  generated  by  the  previous  one.  (All  OUs  are  supposed  to  be 
arranged  in  a  ring.  The  size  of  the  ring  must  be  at  least  a  constant 
estimated  in  the  paper.)  To  employ  this  idea  the  authors  are 
developping  a  special  logical  structure  for  each  OU  that  enables  it  to 
begin  calculations  after  being  fed  with  only  a  part  of  the  digits  of 
input  numbers. 
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4.SISD,  SIMD,  MISD,  MIMD  systems,  high  reliability 
and  real-time  systems. 


The  number  of  articles  discussing,  in  some  context,  these  parallel 
processing  systems  is  relatively  large  (see  [1,3,4,5,11,13,14,15,20,21, 
25,29,31,33,34,35,36,37,38,40,41,41,42,49,58,59,60,73,74,75,77,80,81,82,85]), 
However  most  of  them  only  discuss  the  problem  on  a  very  theoretical 
level  or  represent  the  system  very  schematically  for  analysis  of 
information  streams,  efficiency  etc. 

In  this  section  we  only  consider  work  which  gives  a  more  detailed 
and  refined  discussion  of  the  practical  implementation  of  parallel 
computer  systems.  Other  work  is  discussed  in  the  sections  of  the 
review  to  which  they  are  more  closely  related. 

In  [34,35]  a  short  description  of  the-  computer  M-10  is  given. 
This  machine  is  organized  to  effectuate  what  the  author  calles  "natural 
parallelism"  (close  to  the  SIMD-principle)  as  well  as  "parallelism  of 
adjacent  operations"  (close  to  the  MISD-principle) .  Also  the 
possibility  of  connecting  together  small  numbers  of  such  machines  is 
provided;  such  complexes  could  realize  "independent  branch  parallelism" 
(close  to  the  MIMD-principle) .  We  give  a  more  detailed  description  of 
this  computer  as  articles  [34,35]  are  not  yet  available  in  English. 

The  main  processor  of  the  machine  consists  of  two  128-bit  rows. 
Each  row,  depending  on  the  instruction  to  be  executed,  corresponds 
either  to  eight  16-bit  processors,  four  32-bit  processors,  or  two 
64-bit  processors  executing  the  same  operation  (but  generally  on 
distinct  data).  The  other  row  can,  during  the  same  cycle,  execute  the 
same  or  a  distinct  operation.  Some  instructions  make  both  rows' 
represent  a  single  vector  processor.  For  axample,  when  computing  a 
scalar  product  (which  is  included  in  the  set  of  machine  instructions), 
during  the  same  machine  cycle  eight  pairs  of  16-bit  or  four  pairs  of 
32-bit  numbers  are  multiplied  and  the  partial  products  are  added  to 
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each  other  and  to  sums  accumulated  in  Che  previous  cycle.  Along  with 
the  computations  in  the  two  main  rows  the  computer  generates  up  to  five 
strings  of  logical  variables  with  each  bit  corresponding  to  some  flag 
(overflow,  equality  of  operands,  etc.).  A  special  processor  works  with 
this  string  simultaneously  with  the  processors  of  the  main  rows.  These 
logical  strings  can  be  used  in  branching  and/ or  masking  mechanisms  of 
program  flow  control. 

The  internal  memory  of  the  computer  is  5  megabytes,  and  the 
average  speed  is  about  5  million  operations  per  second. 

Article  [40]  discusses  the  organization  of  the  elementary  machine 
(EM)  of  a  Homogenous  Computational  System  (see  section  1)  and  the 
organization  of  intermachine  exchange  as  well  as  the  introduction  of 
language  devices  that  make  it  possible  to  specify  the  order  of  operator 
execution  and  to  program  hardware-software  interfaces  that  implement 
interregister  transfers  both  within  one  EM  and  among  different  EMs . 
The  system  is  described  in  great  detail,  including  the  algorithmic 
implementation  of  various  portions  of  this  design  (semaphores, 
generation,  cancellation,  interaction  of  operators,  etc.). 

Three  versions  of  a  multi-functional  cell  in  a  rearrangable 

multi-purpose  uniform  rectangular  computing  array  at   the   logical 

component  level  of  design  are  represented  in  [4].  The  functioning  of 
the  cells  is  modelled. 

In  [3]  a  computer  algorithm  for  global  redundancy  of  digital  units, 
in  uniform  computing  arrays   (see  section   1)  with  single  faults  is 
represented.   The  algorithm  uses  only  one  half  of   the   number  of 
redundant  cells  used  by  conventional  methods. 

In  [14]  the  authors  consider  a  system  chat  consists  of  m  PEs .  The 
computational  problem  to  be  solved  has  a  given  computational  complexity 
and  is  divided  into  several  stages.  Each  stage  is  solved  on  each  PE 
independantly;  then  the  results  are  compared.   The  subsequent  actions 
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depend  on  Che  result  of  Che  comparison  and  on  che  given  scrategy.   The 
average  Cime  for  performing  one  stage  is  studied  analytically. 

A  refinemenC  of  Che  SIMD  computational  archiCechture  for  the  class 

of  problems  that  require  on-line  processing  of  a  stream  of  data  going 

2     3 
from  a  number  (10  -  10  )  of  sensing  devices   is  considered   in   [73]. 

The  specification  of  such  problems  is  such  that  different  coordinates 

of  the  processed  vector  (i.e.   scalar  streams   from  different   sensing 

devices)   may  require  distinct  processing  algorithms.   The  author 

suggests  the  introduction  of  what  he  calls  "Characteristic  Vectors", 

whose   binary   components   correspond   to   the   individual   scalar 

computations.   Such  vectors  can  be  produced  during  the  computation  by  a 

special  command  or  may  be  formed   before  program  execution  as  a 

constant.   In  a  special  tracing  mode  such  a  vector  controls  execution 

so   that  a  given  command  will  be  performed  on  a  given  component  of  a 

data-vector  only  if  the  value  of  the  corresponding  coordinate  of  the 

characteristic  vector  is  1.  The  possibility  of  implementing  various 

branching   constructs    (IF... DO,    IF. ..THEN. . .ELSE,    WHILE... DO, 

REPEAT. . .UNTIL)  using  characteristic  vectors  and  stack  mechanisms  is 

also  discussed  for  structured  programs  (no  blocks  intersect,  there  can 

be  only  one  entrance/exit  to  loops,  etc.). 

In  [21]  the  authors  give  a  general  representation  of  computations 
in  SIMD-machines  using  the  method  of  "periodically  defined  functions". 
Periodically  defined  functions  are  those  defined  on  the  set  of  finite 
data  structures  closed  under  the  basic  operations  on  the  structure 
elements  and  operations  changing  the  data  structure  (shifts, 
superpositions,  cuts).  These  functions  are  obtained  by  solving  a 
system  of  equations  (for  the  unknown  functions  fi,...,f  )  of  the  form 

^i  °  ^i*-^!' •••»^n'-'^l' •••'^m^  '  i=I,...,n, 

where  Xj^,...,Xjjj  are  the  variables  or  data  structures,  y  =  f,(x,  ,...,x  ) 
are  unknown  functions,  ^{^^i* • • ' f^n+m^  ^^^  known  functions  on  the  set 
of  data  structures.  One  can  define  the  partial  order  relation  =<  on 
the  set  of  data  structures,  setting  s  =<  t  if  s  is  undefined  or  s  and  t 
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coincide.  This  relation  is  extended  in  the  usual  way  to  functions  on 
the  set  of  data  structures.  As  a  result,  using  the  known  theorems  on 
fixed  points  in  partially  ordered  sets,  we  can  show  that  the  described 
system  has  a  solution  under  certain  restrictions  on  the  right-hand 
sides  of  the  equations.  Tnese  restrictions  include  regularity  of  the 
functions  ^4(2^, . • • ,2^4^) »  loec,  every  function  may  be  represented  by 
an  expression,  where  every  sum  consists  of  terms  whose  codomains  are 
disjoint  for  any  combination  of  the  arguments.  The  introduction  of 
periodically  defined  functions  into  programming  languages  makes  it 
possible  to  implement  parallel  synchronous  computations  on  data 
structures.  An  example  of  a  generalization  of  the  assignment  operator 
which  makes  it  possible  to  describe  periodically  defined 
transformations : 

The  operator 

a[i,j]  :-  a[i,j]  -  (a[i,j]/a[k,k])  ^a[k,j], 

where  k  =<  i,j  =<  n; 

describes  a  group  operation  on  the  matrix  a[l:n,l:n]  for  some  k,  1  =<  k 
=<  n.  All  the  matrix  elements  with  the  coordinates  specified  in  the 
operator  (i.e.,  k  =•<  i,j  =<  n)  are  computed  simultaneously. 

According  to  [44]  the  theory  of  periodically  defined  functions 
constitutes  a  promising  foundation  for  the  development  of  collections 
of  parallel  group  macro  operations  on  regular  data  structures  with 
efficient  implementation  of  these  operations  on  various  configurations 
of  parallel  processers. 

The  problem  considered  in  [41]  is  to  develop  an  intercomputer 
communication  network  acceptable  for  the  use  of  Homogenous 
Computational  Systems.  The  authors  first  informally  list  conditions 
that  have  to  be  held  for  such  systems  and  the  corresponding  graphs. 
Among  these  is  the  ability  to  change  the  number  of  computers  in  the 
system  without  essential  changes  of  the  network  (only  local  changes  are 
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admissible) .  Other  conditions  are  minimal  distance  between  any  two 
computers,  and  the  limitation  on  the  number  of  local  connections  to 
each  computer  in  the  network.  The  authors  then  develop  a  procedure  for 
generating  a  family  of  graphs  satisfying  limitations  that  express  the 
above  mentioned  properties.  Among  these  are  the  limitations  of  degree 
and  girth  of  the  graphs,  and  the  minimality  of  diameter  and  mean 
diameter. 

In  [20]  the  authors  view  the  design  theory  of  multiprocessor 
computer  hardware  and  software  as  a  refinement  or  special  branch  of  the 
theory  of  digital  system  design  whose  foundations  have  been  set  forth 
in  [18].  The  main  task  of  this  theory  is  the  development  of  a  method 
of  formalized  technical  tasks  as  applied  to  the  design  of  computer 
hardware  and  software.  This  point  of  view  defines  the  scope  of  the 
questions  discussed  in  the  paper.  It  contains  the  basic  mathematical 
models  proposed  for  the  description  of  multiprocessor  systems,  the 
methods  of  implementing  parallel  processing  in  these  systems,  and  a 
review  of  the  principal  directions  in  the  development  of  a  general 
methodology  of  multiprocessor  system  design.  According  to  the  authors 
one  of  the  central  ideas  of  their  work  is  that  hardware  design  and 
program  development  for  multiprocessor  systems  have  a  common 
theoretical  and  methodological  basis. 


5. Scheduling,  control,  estimation  and  efficiency 
of  parallel  programs  and  systems.  Operating  systems. 


In  articles  [8,16,17]  the  following  scheduling  problem  is  studied: 
m  processes  have  n  critical  sections  each.  Each  critical  section  may 
be  entered  by  only  one  process  at  any  given  time.  Processes'  requests 
to  enter  the  critical  section  are  assumed  random  and  describable  in  the 
absence  of  conflicts  by  a  Markovian  chain.  The  problem  is  which 
process  should  have  priority  in  case  of  conflict.  (The  criterion  to  be 
minimized  is  the  mean  frequency  of  conflicts). 
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In  [16]  the  case  m=2,  n  arbitrary  is  considered.  The  answer:  for 
each  of  the  n  critical  sections  assign  priority  to  that  process  among 
the  two  which  is  less  likely  to  enter  the  conflict-generated  state. 

In  [8]  the  case  m  arbitrary,  n=l  is  considered.  The  result  is 
formulated  in  the  same  way. 

In  [17]  the  most  general  case,  m  and  n  arbitrary,  is  considered. 
The  optimal  resolution  rule  looks  the  same  but  only  the  case  of  weakly 
connected  processes  is  considered,  i.e.  the  probabilities  of  entering 
the  critical  sections  are  sufficiently  small. 

The  conveyer  (pipe-line)  system  for  concurrent  command  execution 
is  studied  using  a  Markov  model  in  [38].  The  existence  of  several 
condition  flag  schemes  (CFS)  is  taken  into  consideration.  If  one  uses 
several  CFS  the  parallelizability  of  the  command  stream  increases  as 
the  distances  between  dependant  commands  does.  The  authors  point  out 
that  previous  work  on  productivity  of  pipe-line  models  never  consider 
more  than  a  single  CFS. 

In  [82]  the  following  model  is  considered.  The  system  consists  of 
n  processing  elements  (PEs).  Each  PE  consists  of  one  Command  Processor 
(CP),  that  processes  command  flow  and  one  Operational  Processor  (OP), 
that  processes  data  flow.  A  common  bus  of  K  channels  is  provided  to 
route  the  requests  from  those  CPs  whose  OPs  are  busy  to  free  OPs .  The 
productivity  of  the  system  is  studied. 

In  [80]  the  multiprocessor  system  considered  consists  of:  several 
PEs;  a  bus;  and  main  (common)  memory  (MM).  Each  PE  can  address  its  own 
High  Speed  Memory  (HSM) ,  the  HSM  of  their  neighbors  PEs,  and  MMs .  The 
most  expensive  is  HSM.  The  access  times  to  the  three  hierarchical 
levels  of  memory  are  given  as  well  as  the  total  cost  of  the  system  and 
probability  of  succesful  access  to  required  data  at  the  first  two 
levels.  These  probabilities  are  expressed  in  the  form  of  functions  of 
the  number  of  PEs  and  the  size  of  a  memory  page.  The  problem  of 
minimization  of   the  average  access  time  is  considered  as  a  non-linear 
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programming  problem.   The  solution  in  Che  form  of  a   system  of   linear 
equations  is  given. 

The  optimal  scheduling  problem  for  abstract  multiprocessor  systems 
is  considered  in  [1].  The  case  when  all  the  jobs  have  the  same  run 
time  and  there  are  only  two  PEs  is  discussed.  A  schedule  that 
minimizes  the  number  of  jobs  uncompleted  by  a  given  time  and  the  total 
run  time  for  all  jobs  is  constructed. 

The  scheduling  problem  for  deterministic  acyclic  models  of 
parallel  computation  is  considered  in  [49].  Low  bounds  of  the 
schedules  using  algorithms  of  polynomial  complexity  are  proposed. 

A  practical  scheduling  algorithm  for  the  distribution  of  jobs 
among  small  numbers  (2-8)  of  computational  units  is  suggested  in  [81]. 

In  [60]  the  problem  of  an  optimal  scheduler  for  minimization  of 
storage  is  reduced  to  the  following:  Given  the  matrix  Pj j 
(1=1, . ..,m; j=l, . ..n)  of  the  storage  required  by  the  i-th  PE  for 
execution  of  the  j-th  subroutine  (the  storage  cannot  be  shared)  find 
the  set  of  permutations  ej(j)  such  that 

m 

'"^  ^^i,ei(j) 
i-1 

is  minimal.  The  work  of  the   scheduler   to   solve  this  problem  is 

...  1 1 
0(mn   log2n).   Also   the  following  allocation  problem,  which  requires 

work  of  0(n  ^™'  ^      ),  is  considered.   One  has  to  allocate  each  one  of  n 

jobs  among  k-1   devices  and   there  are  n  devices  of  each  type.   Two 

different  jobs  cannot  be  allocated  for  the  same  device.   The  cost  of 

allocating   job  i,  in  devices  i^,  ii.  of  types  2,...,k,  respectively,  is 

P.   <      J    .      The  allocation  of  minimum  cost  is  required. 
1'2' **' '  k 

In  [74]  the  system  discussed  consists  of  several  PEs  and  several 
blocks  of  Common  Memory  (3CM).  Connectivity  of  the  programs  is  defined 
as  the  percentage  of  commands  that  require  access  to  the  same  3CM.  If 
several  PEs  are  attempting   to  access  the  same  BCM  in  the  same  cycle 
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only  one  gets  access;  the  others  wait.  Given  the  frequency  of  access, 
the  average  time  for  one  instruction  execution  (which  is  equal  for  all 
the  PEs)  while  independantly  working,  and  the  average  service  time, 
with  Poisson  distribution  of  the  service  time,  an  analysis  of 
productivity  is  performed.  The  analysis  is  analytical  for  2  and  3  PEs 
and  uses  statistical  modelling  for  larger  numbers. 

A  method  of  simulation  of  SIMD  type  computers  is  proposed  in  [13]. 
The  structural  characteristics  of  this  class  of  computers,  the  features 
of  the  algorithms  being  implemented,  and  the  actual  operational 
conditions  (failures,  malfunctions,  use  of  control  programs  of  the 
operating  system  and  their  characteristics,  the  characteristics  of  the 
system  for  monitoring  operations  etc.)  are  taken  into  account.  The 
efficiency  is  characterised  by  mean  implementation  time  for  algorithms, 
mean  probability  of  correct  implementation  of  algorithm,  and  the  mean 
cost  factor  of  computer  hardware  utilization.  Results  of  practical 
simulation  are  given. 

Analysis  of  the  degree  of  loading  of  a  multiprocessor  shared-use 
computer  system  is  done  based  on  the  following  model:  K  users  are 
trying  to  load  their  k  jobs  (including  empty  ones)  into  a  system 
consisting  of  M  PEs.  The  maximum  number  of  PEs  requested  by  one  user 
is  N.  The  probabilities  of  requests  for  any  number  of  PEs  in  the 
interval  0,1,..., N  are  equal  for  each  user.  The  dispatcher  decides  at 
each  time  step  whose  jobs  are  to  be  loaded.  The  characteristics  of  the 
system,  such  as  the  average  number  of  serviced  users,  are  computed 
based  on  a  distribution  P.  (M,N)  of  all  favorable  combinations  of 
requested  numbers  of  PEs  (when  no  request  is  rejected)  to  the  total 
number  of  such  combinations.  (Each  combination  is  considered  to  have 
equal  probability). 

In  [25]  a  mathematical  model  is  proposed  which  makes  it  possible 
to  calculate  the  unproductive  time  spent  by  processors  of  multicomputer 
installations.  The  idea  of  this  model  is  to  consider  a  network  each 
node  of  which  corresponds  Co  a  device  in  the  system  (processor,  I/O 
device  etc.).   The  cessation  of   job  processing  by  some  device  is 
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simulated  by  a  request  moving  from  the  point  that  simulates  this  device 
to  the  point  that  performs  the  next  stage  of  job  execution.  Queueing 
systems  are  set  up  for  the  input  flow  to  each  point.  The  losses  can  be 
expressed  in  terms  of  time  spent  in  a  subnet  of  the  net  and  its  queues. 
The  existence  and  uniqueness  conditions  for  the  solution  of  the  model 
are  formulated  and  proved  on  the  basis  of  qualitative  methods  of 
nonlinear  functional  analysis. 

The  problem  of  analysis  of  the  functioning  of  a  homogenous 
computational  system  is  studied  in  [15].  This  system  is  a 
multiprocessor  that  works  in  the  interactive  regime,  accepting  a  stream 
of  users'  tasks  each  of  which  requires  a  given  number  of  PEs  to  be 
reserved  for  it.  A  queue  is  set  up  and  tasks  overflowing  it  have  to  be 
rejected.  Strong  analytical  analysis  is  difficult;  some  relations  that 
are  invariant  with  respect  to  the  law  of  distributions  are  established. 

In  [85]  the  following  model  of  a  parallel  computer  is  considered: 
The  computer  has  a  fixed  n  computing  units  (CU),  each  of  which  consists 
of  the  same  variable  nvmiber  m  of  computational  devices  (CD)  and  one 
control  block  (CB).  The  CUs  are  connected  through  a  fixed  k  channels 
each  of  which  has  the  same  variable  throughput  T.  The  geometry  of 
connection  (common  bus,  circle,  radial)  and  the  finite  speed  of  signal 
propagation  in  the  connection  are  explicitly  taken  into  consideration. 
Each  CB  can  request  CDs  from  its  own  CU  or  other  CUs  in  the  system. 
Given  the  required  level  of  productivity  of  such  a  system  the  cost  is 
minimized.  (The  trade-off  is  between  T  and  n.)  The  analysis  of  the 
system  is  based  on  a  simplified  probabilistic  model. 

With  the  aim  of  allocating  jobs  among  the  PEs  of  a  multiprocessor 
efficiently  a  measure  of  the  quality  of  state  of  a  PE  is  introduced  in 
[75].   A  comparative  estimate  procedure  of  the  quality  is  suggested. 

The  problem  of  efficient  loading  of  a  specialized  multiprocessor 
system  is  represented  in  [42]  as  follows:  There  is  a  set  of  tasks  and  a 
set  of  types  of  processors.  T^ ^  is  the  time  required  for  solution  of 
problem  i  by  a  processor  of   type   j  (some  Tj^  can  equal  ").   Each 
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processor  of  type  j  is  characterized  by  a  set  of  technical  parameters 
(such  as  weight,  power  consumption,  cost,  etc.).  The  problem  is  to 
choose  the  number  of  processors  of  each  type,  and  to  fix  types  of 
problems  for  the  processor  so  that  all  technical  parameteric  restraints 
will  be  satisfied  and  some  criterion  (for  example  time  required  for 
solution  all  of  problems)  will  be  optimal.  This  problem  is  considered 
as  one  of  linear  programming. 

A  model  of  an  operating  system  for  parallel  program  streams  is 
discussed  in  [59].  The  author  first  discusses  various  models  of 
representation  of  programs  for  execution  on  parallel  computers.  He 
settles  on  a  so-called  bilogical  representation  of  a  parallel  stream  of 
tasks  (suggested  in  [61]).  A  "tri-logical"  representation  is  then 
suggested.  According  to  the  author  it  is  better  suited  than  the 
bilogical  model  for  the  problem  under  consideration.  In  particular  the 
suggested  model  requires  less  memory  for  its  representation  because  it 
represents  more  large  fragments  of  the  parallel  stream  (however,  it  is 
more  complicated  conceptually).  The  author  also  discusses  the  details 
of  realization  of  this  model  in  programming  languages  by  supplementing 
sequential  languages  with  special  constructs  for  waiting  and  spawning 
tasks.  Some  details  of  machine- language  realization  of  these 
constructs  and  an  algorithm  for  their  handling  by  the  operating  system 
(which  is  called  "Synchronization  of  fragments  execution"  in  [59])  are 
represented.  A  programming  simulation  was  done  to  estimate  the 
coxplexity  of  this  "Synchronization"  algorithm.  The  average  time 
consumed  was  about  600  instructions  per  one  fragment  execution. 

The  method  of  parallel  branches  used  for  the  organization  of 
computations  in  homogenous  systems  [64]  pursues  a  different  goal.  Each 
branch  is  a  sequential  program,  and  there  are  special  branch 
interaction  operators,  which  in  their  turn  are  divided  into  individual 
and  group  interaction  operators.  The  individual  interaction  operators 
monitor  the  individual  access  of  the  branches  to  common  data,  whereas 
the  group  interaction  operators  coordinate  simultaneous  group  access. 
The  individual  interaction  operators  include  semaphore  changing 
operators,   read  and  write  operators  with  delay  conditions  intended  to 
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resolve  conflicts,  etc.  Group  operators  include  generalized 
conditional  transitions  in  all  the  branches,  except  those  marked  by 
special  control  variables;  simultaneous  access  operators  for  all  the 
branches  to  data  files;  a  shift  operator  shifting  data  files  between 
branches;  an  operator  simultaneously  transmitting  data  from  all  the 
branches  to  one  branch,  etc.  Group  operators  synchronize  the  different 
branches  since  they  are  initiated  when  each  unmasked  branch  accesses 
its  copy  of  the  group  operator.  This  approach  essentially  combines  the 
concepts  of  organization  of  asynchronous  interactions  of 
sequential-parallel  branch  processes  with  certain  control  operators 
characteristic  of  programming  solutions  to  synchronous  parallelism  of 
group  operations  on  data  structures  with  delay  conditions  intended  to 
resolve  conflicts,  etc. 

The  problem  of  efficient  virtual  memory  management  in  a 
multiprocessor  system  is  discussed  in  [23].  A  qualitatively  new 
approach  (compared  to  the  sequential  case)  is  suggested  for  a  paging 
procedure.  The  new  paging  algorithm  (compared  in  the  paper  to  the 
"banker's"  algorithm  of  Dijkstra)  is  suggested.  The  difference  between 
the  suggested  procedure  and  the  "banker's"  algorithm  is  that  the  former 
allows  sharing  pages  between  different  processes  while  the  latter  does 
not.  The  authors  mention  that  their  paging  algorithm  was  implemented 
in  the  "pseudoparallel  regime"  and  was  several  times  more  efficient 
than  the  "serial"  algorithm  when  used  in  a  real  data  processing  problem 
that  included  searches  among  large  arrays  of  data  on  disks . 
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